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Abstract. CrocoPat is an efficient, powerful and easy-to-use tool for manipulating relations of 
arbitrary arity, including directed graphs. This manual provides an introduction to and a reference 
for CrocoPat and its programming language RML. It includes several application examples, in 
particular from the analysis of structural models of software systems. 

1 Introduction 

CrocoPat is a tool for manipulating relations, including directed graphs (binary relations). CrocoPat is 
powerful, because it manipulates relations of arbitrary arity; it is efficient in terms of time and memory, 
because it uses the data structure binary decision diagram (BDD, |Bry86|Bry92| ) for the internal repre- 
sentation of relations; it is fairly easy to use, because its language is simple and based on the well-known 
predicate calculus; and it is easy to integrate with other tools, because it uses the simple and popular 
Rigi Standard Format (RSF) as input and output format for relations. CrocoPat is free software (released 
under LGPL) and can be obtained from http://www.software-systemtechnik.de/CrocoPat 

Overview. CrocoPat is a command line tool which interprets programs written in the Relation 
Manipulation Language (RML). Its inputs are an RML program and relations in the Rigi Standard 
Format (RSF), and its outputs are relations in RSF and other text produced by the RML program. The 
programming language RML, and the input and output of relations from and to RSF files are introduced 
with the help of many examples in Section |21 Section |21 describes advanced programming techniques, in 
particular for improving the performance of RML programs and for circumventing limitations of RML. 
The manual concludes with references of CrocoPat's command line options in Section 01 of RSF in 
Section [Sj and of RML in Section |H| The RML reference includes a concise informal description of the 
semantics, and a formal description of the syntax and the core semantics. 

Applications. CrocoPat was originally developed for analyzing graph models of software systems, 
and in particular for finding patterns in such graphs i BNLOSj . Existing tools were not appropriate 
for this task, because they were limited to binary relations (e.g. Grok |Hol98| . RPA FKv098 , and 
RelView BLM02 ), or consumed too much time or memory (e.g. relational database management systems 
and Prolog interpreters). Applications of graph pattern detection include 

— the detection of implementation patterns RW90 HN90 Har91'Qui94', object-oriented design patterns 
(Section in [MS95 KP96 AFC98 KSR P99 NSW+02 ), and architectural styles Hol96 , 

- the detection of potential design problems (Sectional |MS95ISSC9f)li''Kv( )98IKB98I( :;iu99II''jJ00) 'l . 
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— the inductive inference of design patterns jSMB96ITA99| . 

— the identification of code clones KriOl', 

— the extraction of scenarios from models of source code |WHH02| . and 

— the detection of design problems in databases |Bla04| . 

The computation of transitive closures of graphs - another particular strength of CrocoPat - is 
not only needed for the detection of some of the above patterns, but has also been applied for dead 
code detection and change impact analysis CGK98 FKv098 . Computing and analyzing the difference 
between two graphs supports checking the conformance of the as-built architecture to the as-designed 
architecture jSS( ]96IFKv( )98IM WD99IFH00IMN^ . and studying the evolution of software systems 
between different versions. Calculators for relations have also been used to compute views of systems on 
different levels of abstraction by lifting and lowering relations |FKvO98IFH00j , and to calculate software 
metrics (Section ESI |MS95IKW99l l. 

Although we are most familiar with potential applications in the analysis of software designs, we are 
confident that CrocoPat can be beneficial in many other areas. For example, calculators for relations 
have been used for program analyses like points-to analysis BLQ+OS] , and for the implementation of 
graph algorithms (Section |BBg9^ ). 



2 RML Tutorial 



This section introduces RML, the programming language of CrocoPat, on examples. The core of RML 
are relational expressions based on first-order predicate calculus, a well-known, reasonably simple, precise 
and powerful language. Relational expressions are explained in Subsection 12. II and additional examples 
involving relations of arity greater than two are given in Subsection l2.4l Besides relational expressions, the 
language includes control structures, described in Subsection 12.31 and numerical expressions, described 
in Subsection 12. 51 The input and output of relations is described in Subsection 12. 21 A more concise and 
a more formal specification of the language can be found in Section |^ 

Although the main purpose of this section is the introduction of the language, some of the application 
examples may be of interest by themselves. In Subsection l2.3l simple graph algorithms are implemented, 
and in the Subsections 12 . 41 and 12 . 51 the design of object-oriented software systems is analyzed. 



2.1 Relational Expressions 

This subsection introduces relational expressions using relationships between people as example. Re- 
member that n-ary relations are sets of ordered n-tuples. In this subsection, we will only consider the 
cases n = 1 (unary relations) and n — 2 (binary relations, directed graphs). CrocoPat manipulates tuples 
of strings, thus unary relations in CrocoPat are sets of strings, and binary relations in CrocoPat are sets 
of ordered pairs of strings. 

Adding Elements. The statement 

Male ("John") ; 

expresses that John is male. (In some languages, e.g. the logic programming language Prolog |CM03| . 
such statements are called facts.) It adds the string John to the unary relation Male. Because each 
relation variable initially contains the empty relation, John is so far the only element of the set Male. 
An explicit declaration of variables is not necessary. However, variables should be defined (i.e., assigned 
a value) before they are first used, otherwise CrocoPat prints a warning. 

Male ("Joe") ; 

adds the string Joe to the set Male, such that it now has two elements. Similarly, we can initialize the 
variable Female: 

Female("Alice") ; 
Female ("Jane") ; 
Female ("Mary") ; 
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To express that the John and Mary are the parents of Alice and Joe, and Joe is the father of Jane, 
we create a binary relation variable ParentOf which contains the five parent-child pairs: 

ParentDf ("John", "Alice"); 
ParentOf ("John" , "Joe"); 
ParentOf ("Mary" , "Alice"); 
ParentOf ("Mary" , "Joe"); 
ParentOf ("Joe" , "Jane"); 

Assignments. The following statement uses an attribute x to assign the set of Joe's parents to the 
set JoesParent: 

JoesParent (x) := ParentOf (x, "Joe"); 

Now JoesParent contains the two elements John and Mary. As another example, the following assignment 
says that x is a child of y if and only if y is a parent of x: 

ChildOf (x , y) : = ParentOf (y , x) ; 

John is the father of a person if and only if he is the parent of this person. The same is true for Joe: 

Father Of ("John", x) := ParentOf ("John" , x) ; 
FatherOf ("Joe" , x) := ParentOf ("Joe" , x) ; 

Because the scope of each attribute is limited to one statement, the attribute in the first statement and 
the attribute in the second statement are different, despite of their equal name x. 

Basic Relational Operators. The relation FatherOf can be described more concisely: x is father 
of y if and only if x is a parent of y and x is male: 

FatherOf (x,y) := ParentOf (x,y) & Male(x); 

Of course, we can define a similar relation for female parents: 

MotherOf (x,y) := ParentOf (x,y) & Female (x); 

Besides the operator and (&), another important operator is or (|). For example, we can define the 
ParentOf relation in terms of the relations MotherOf and FatherOf: x is a parent of y if and only if x is 
the mother or the father of y: 

ParentOf (x,y) := MotherOf (x,y) I FatherOf (x,y) ; 

Quantification. Parents are people who are a parent of another person. More precisely, x is a parent 
if and only if there exists (EX) a y such that x is a parent of y. 

Parent (x) := EX(y, ParentOf (x,y)) ; 

Now the set Parent consists of John, Mary, and Joe. There is also an abbreviated notation for existential 
quantification which is similar to anonymous variables in Prolog and functional programming languages: 

Parent (x) := ParentOf (x,_) ; 

With the operator not ( ! ), we can compute who has no children: 
Childless (x) := !EX(y, ParentOf (x,y) ) ; 

Equivalently, x childless if for all (FA) y holds that x is not a parent of y: 

Childless (x) := FA(y, ! ParentOf (x,y) ) ; 

In both cases, the set Childless contains Alice and Jane. 

Transitive Closure. To compute the grandparents of a person we have to determine the parents of 
his or her parents: 
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GrandparentOf (x,z) := EX(y, ParentOf (x,y) & ParentDf (y ,z) ) ; 

Now GrandparentOf contains the two pairs (John, Jane) and (Mary, Jane). To find out all ancestors of 
a person, i.e. parents, parents of parents, parents of parents of parents, etc., we have to apply the above 
operation (which is also called composition) repeatedly until the fixed point is reached, and unite the 
results. The transitive closure operator TC does exactly this: 

AncestorOf (x , z) : = TC (ParentOf (x , z) ) ; 

The resulting relation AncestorOf contains any pair from ParentOf and GrsindparentOf . (It also contains 
grand-grandparents etc., but there are none in this example.) The transitive closure operator TC can only 
be applied to binary relations. 

Predefined Relations, the Universe. The relations FALSE and TRUE are predefined. FALSE is the 
empty relation, and TRUE is the full relation. More precisely, there is one predefined relation FALSE and 
one predefined relation TRUE for every arity. In particular, there is also a 0-ary relation FALSE (), which 
is the empty set, and a 0-ary relation TRUEO , which contains only () (the tuple of length 0). Intuitively, 
these 0-ary relations can be used like Boolean literals. By the way, the statement 

Male ("John") ; 

is an abbreviation of the assignment 
Male ("John") := TRUEO ; 

The result of TRUE(x) is the so-called universe. The universe contains all string literals that appear 
in the input RSF stream (if there is one, see Subsection 12. 2|l and on the left hand side of assignments in 
the present RML program. For example, the string literals used on the left hand side of the assignments 
in the examples of this subsection are Alice, Jane, Joe, John, and Mary, so the set TRUE(x) contains 
these five elements. See Subsection 13.41 for more information on the universe. 

The binary relations =, !=, <, <=, >, and >= for the lexicographical order of the strings in the universe 
are also predefined. For example, siblings are two different people who have a common parent: 

SiblingOf (x,y) := EX(z, ParentOf (z,x) & ParentOf (z ,y) ) & !=(x,y); 

The infix notation is also available for binary relations, so the expression ! = (x,y) can also be written 
as x ! =y. Note that the predefined relations, like any other relation, are restricted to the universe. Thus 
the expression "A" = "A" yields FALSEO if (and only if) the string A is not in the universe. 

Further relational expressions are provided to match POSIX extended regular expressions jlEEOlL 
Section 9.4]. These relational expressions start with the character @, followed by the string for the regular 
expression. For example, 

StartsWithJ(x) :=@""J"(x); 

assigns to the set StartsWithJ the set of all strings in the universe that start with the letter J, namely 
Jane, Joe, and John. A short overview of the syntax of regular expressions is given in Subsection 16.21 

Boolean Operators. Two relations can be compared with the operators =, !=, <, <=, >, or >=. 
Because such comparisons evaluate to either TRUEO or FALSEO, they are called Boolean expressions. 
For example, 

GrandparentOf (x,y) < AncestorOf (x,y) 

yields TRUEO, because GrandparentOf is a proper subset of AncestorOf. However, 

GrandparentOf (x,y) = AncestorOf (x,y) 

yields FALSEO, because the two relations are not equal. 

The six comparison operators should not be confused with the six predefined relations for the lexi- 
cographical order. The operators take two relations as parameters, while the predefined relations take 
strings or attributes as parameters. 
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2.2 Input and Output of Relations 

File Format RSF. CrocoPat reads and writes relations in Rigi Standard Format (RSF, jWon98l Sec- 
tion 4.7.1]). Files in RSF are human-readable, can be loaded into and saved from many reverse engineering 
tools, and are easily processed by scripts in common scripting languages. 

In an RSF file, a tuple of an n-ary relation is represented as a line of the form 

RelationNcLme elementl element2 . . . elementn 

The elements may be enclosed by double quotes. Because white space serves as delimiter of the elements, 
elements that contain white space must be enclosed by double quotes. A relation is represented as a 
sequence of such lines. The order of the lines is arbitrary. An RSF file may contain several relations. 

As an example, the relation ParentOf from the previous subsection can be represented in RSF format 
as follows: 

ParentOf John Alice 
ParentOf John Joe 
ParentOf Mary Alice 
ParentOf Mary Joe 
ParentOf Joe Jane 

Input. RML has no input statements. When CrocoPat is started, it first reads input relations in 
RSF from the standard input before it parses and executes the RML program. RSF reading can be 
skipped with the -e command line option. If the input relations are available as files, they can be feeded 
into CrocoPat 's standard input using the shell operator <, as the following examples shows for the file 
ParentOf .rsf: 

crocopat Prog.rml < ParentOf. rsf 

The end of the input data is recognized either from the end of file character or from a line that starts 
with the dot (.) character. The latter is sometimes useful if RSF input is feeded interactively. 

If the above RSF data is used as input, then at the start of the program the binary relation variable 
ParentOf contains the five pairs, and the universe contains the five string literals Alice, Jane, Joe, 
John, and Mary (and additionally all string literals that appear on the left hand side of assignments in 
the program.) 

Output. The PRINT statement outputs relations in RSF format to the standard output. For example, 
running the program 

ParentOf ("Joe" ,x) :=FALSE(x); 
ParentOf (x, "Joe") :=FALSE(x); 
PRINT ParentOf (x,y) ; 

with the above input data prints to the standard output 

John Alice 
Mary Alice 

The statement 

PRINT ["ParentOf"] ParentOf (x,y) ; 

writes the string ParentOf before each tuple, and thus outputs 

ParentOf John Alice 
ParentOf Mary Alice 

The output can also be appended to a file ParentOf 2 . rsf (which is created if it does not exist) with 
PRINT ["ParentOf"] ParentOf (x,y) TO "ParentOf 2. rsf " ; 

or to stderr with 

PRINT ["ParentOf"] ParentOf (x,y) TO STDERR; 



5 



Command Line Arguments. It is sometimes convenient to specify the names of output files at 
the command Hne and not in the RML program. If there is only one output file, the standard output 
can be simply redirected to a file using the shell operator >: 

crocopat Prog.rml < ParentOf.rsf > ParentDf 2 . rsf 

An alternative solution (which also works with more than one file) is to pass command line arguments 
to the program. Command line arguments can be accessed in RML as $1, $2, etc. For example, when 
the program 

ChildDf (x , y) : = ParentDf (y , x) ; 

PRINT ["Child"] ChildOf (x,$l) TO $1 + ".rsf"; 

PRINT ["Child"] ChildOf (x, $2) TO $2 + ".rsf"; 

is executed with 

crocopat lO.rml Joe Mary < ParentOf.rsf 

then the first PRINT statement writes to the file Joe. rsf, and the second PRINT statement writes to 
Mary. rsf. 

Command line arguments are not restricted to speciiying file names, but can be used like string 
literals. However, in contrast to string literals, command line arguments are never added to the universe, 
and thus cannot be used on the left hand side of relational assignments. 

2.3 Control Structures 

This subsection introduces the control structures of RML, using algorithms for computing the transitive 
closure of a binary relation R as examples. 

WHILE Statement. As a first algorithm, the relation R is composed with itself until the fixed point 
is reached. 

Result(x,y) := R(x,y); 
PrevResult (x , y ) : = FALSE (x , y ) ; 
WHILE (PrevResult (x,y) != Result (x,y)) { 
PrevResult (x,y) := Result (x,y); 

Result(x,z) := Result(x,z) | EX(y, Result(x,y) & Result (y,z) ) ; 

} 

The program illustrates the use of the WHILE loop, which has the usual meaning: The body of the loop 

is executed repeatedly as long as the condition after WHILE evaluates to TRUEO. 

FOR Statement. The second program computes the transitive closure of the relation R using the 
Warshall algorithm. This algorithm successively adds arcs. In the first iteration, an arc {u, v) is added 
if the input graph contains the arcs (u, nodeo) and (nodeo,w). In the second iteration, an arc (u, u) is 
added if the graph that results from the first iteration contains the arcs {u, nodei) and (nodei,ti). And 
so on, for all nodes of the graph (in arbitrary order.) 

Result(x,y) := R(x,y); 

Node(x) := Result(x,_) & Result(_,x); 

FOR node IN Node(x) { 

Result(x,y) := Result(x,y) | (Result (x, node) & Result (node ,y) ) ; 

> 

The program illustrates the use of the FOR loop. The relation after IN must be a unary relation. The iter- 
ator after FOR is a string variable and takes as values the elements of the unary relation in lexicographical 
order. Thus, the number of iterations equals the number of elements of the unary relation. 

For the implementation of the transitive closure operator of RML, we experimented with several 
algorithms. An interesting observation in these experiments was that the empirical complexity of some 



6 



algorithms for practical graphs deviated strongly from their theoretical worst case complexity, thus 
some algorithms with a relatively bad worst-case complexity were very competitive in practice. In our 
experiments, the first of the above algorithms was very fast, thus we made it available as operator TCFAST. 
The implementation of the TC operator of RML is a variant of the Warshall algorithm. It is somewhat 
slower than TCFAST (typically about 20 percent in our experiments), but often needs much less memory 
because it uses no ternary relations. 

IF Statement. The following example program determines if the input graph R is acyclic, by checking 
if its transitive closure contains loops (i.e. arcs from a node to itself): 

Self Arcs(x,y) := TC(R(x,y)) & (x = y) ; 
IF (SelfArcs(_,_)) { 

PRINT "R is not acyclic", ENDL; 
} ELSE { 

PRINT "R is acyclic", ENDL; 

> 



2.4 Relations of Higher Arity 

In this subsection, relations of arity greater than two are used for finding potential design patterns and 
design problems in structural models of object-oriented programs. The examples are taken from |BNL03) . 

The models of object-oriented programs contain the call, containment, and inheritance relationships 
between classes. Here containment means that a class has an attribute whose type is another class. The 
direction of inheritance relationships is from the subclass to the superclass. As an example, the source 
code 

class ContainedClass {} 
class Superclass {} 
class Subclass extends Superclass { 
ContainedClass c; 

} 

corresponds to the following RSF file: 

Inherit SubClass Superclass 
Contain SubClass ContainedClass 



Component 









Leaf 




Composite 



Fig. 1. Composite design pattern 



Composite Design Pattern. Figure ^ shows the class diagram of the Composite design pat- 
tern |GIIJV95] . To identify possible instances of this pattern, we compute all triples of a Component 
class, a Composite class, and a Leaf class, such that (1) Composite and Leaf are subclasses of Compo- 
nent, (2) Composite contains an instance of Component, and (3) Leaf does not contain an instance of 
Component. The translation of these conditions to an RML statement is straightforward: 
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CompPat (component , composite, leaf) := Inherit (composite , component) 

& Contain(composite , component) 
& Inherit (leaf , component) 
& ! Contain (leaf , component); 

Degenerate Inheritance. When a class C inherits from another class A directly and indirectly via a 
class B, the direct inheritance is probably redundant or even misleading. The following statement detects 
such patterns: 

Deglnh(a,b, c) := Inherit(c,b) 
& Inherit (c, a) 
& TC (Inherit (b, a)) ; 

Cycles. To understand an undocumented class, one has to understand all classes it uses. If one of 
the (directly or indirectly) used classes is the class itself, understanding this class is difficult. All classes 
that participate in cycles can be found using the transitive closure operator, as shown in Subsection l2.3l 
However, in many large software systems hundreds of classes participate in cycles, and it is tedious for 
a human analyst to find the actual cycles in the list of these classes. In our experience, it is often more 
useful to detect cycles in the order of ascending length. As a part of such a program, the following 
statements detects all cycles of length 3. 

Use(x,y) := Call(x,y) I Contain(x,y) I Inherit (x,y) ; 
Cycle3(x,y,z) := Use(x,y) & Use(y,z) & Use(z,x); 
Cycle3(x,y,z) := Cycle3(x,y,z) & (x <= y) & (x <= z) ; 

To see the purpose of the third statement, consider three classes A, B, and C that form a cycle. After 
the second statement, the relation variable CycleS contains three representatives of this cycle: (A, B, C), 
(B, C, a) and (C, A, B). The third statement removes two of these representatives from CycleS, and keeps 
only the tuple with the lexicographically smallest class at the first position, namely (A, B, C). 

2.5 Numerical Expressions 

In this subsection, a software metric is calculated as example for the use of numerical expressions in 
RML programs. Therefore, we extend the structural model of object-oriented programs introduced in 
the previous subsection with a binary relation PackageDf . This relation assigns to each package the 
classes that it contains. (Packages are high-level entities in object-oriented software systems that can be 
considered as sets of classes.) 

Robert Martin's metric for the instability of a package is defined as cejica + ce), where ca is the 
number of classes outside the package that use classes inside the package, and ce is the number of classes 
inside the package that use classes outside the package |Mar97| . 

Use(x,y) := Call(x,y) I Contain(x,y) I Inherit (x,y) ; 
Package (x) := PackageOf (x,_) ; 
FOR p IN Package (x) { 

CaClass(x) := ! PackageOf (p,x) & EX(y, Use(x,y) & PackageOf (p,y) ) ; 

ca := #(CaClass (x) ) ; 

CeClass(x) := PackageOf (p,x) & EX(y, Use(x,y) & ! PackageOf (p,y)) ; 
ce := #(CeClass(x)) ; 
IF (ca + ce > 0) { 

PRINT p, " ", ce / (ca+ce), ENDL; 

} 

> 



8 



3 Advanced Programming Techniques 

This section describes advanced programming techniques, in particular for improving efficiency and 
circumventing language limitations. The first subsection explains how to control the memory usage of 
CrocoPat. The second and third subsection describe how relational expressions are evaluated in CrocoPat, 
and how to assess and improve the efficiency of their evaluation. The fourth subsection explains why the 
universe is immutable during the execution of an RML program and how to work around this limitation. 

3.1 Controlling the Memory Usage 

CrocoPat represents relations using the data structure binary decision diagram (BDD, ' Bry86| ). When 
CrocoPat is started, it reserves a fixed amount of memory for BDDs, which is not changed during the 
execution of the RML program. If the available memory is insufficient, CrocoPat exits with the error 
message 

Error : BDD package out of memory . 

The BDD memory can be controlled with the command line option -m, followed by an integer number 
giving the amount of memory in MByte. The default value is 50. The actual amount of memory reserved 
for BDDs is not infinitely variable, so the specified value is only a rough upper bound of the amount of 
memory used. 

It can also be beneficial to reserve less memory, because the time used for allocating memory in- 
creases with the amount of memory. When the manipulated relations are small or the algorithms are 
computationally inexpensive, memory allocation can dominate the overall runtime. 

3.2 Speeding up the Evaluation of Relational Expressions 

This subsection explains how CrocoPat evaluates relational expressions. Based on this information, hints 
for performance improvement are given. Understanding the subsection requires basic knowledge about 
BDDs and the impacts of the variable order on the size of BDDs. An introduction to BDDs is beyond 
the scope of this manual, we refer the reader to |Bry92| . 

The attributes in an RML program are called user attributes in the following. For example, the 
expression R(x,y) contains the user attributes x and y. For the internal representation of relations, 
CrocoPat uses a sequence of internal attributes, which are distinct from the user attributes. We call 
these internal attributes il, i2, i3, ... For example, the binary relation R is internally represented as a 
set of assignments to the internal attributes il and 12. 

When the expression R(x,y) is evaluated, the internal attributes 11 and 12 are renamed to the user 
attributes x and y. Therefore all BDD nodes of the representation of R have to be traversed. Thus, 
the time for evaluating the expression R(x,y) is at least linear in the number of BDD nodes of R's 
representation. 

The order of the internal attributes in the BDD is always 11, 12, ... The order of the user attributes 
in the BDD may be different in the evaluation of each statement, because the scope of user attributes is 
restricted to one statement. The order of the user attributes in the BDD in the evaluation of a statement 
is the order in which CrocoPat encounters the user attributes in the execution of the statement. In the 
example statement 

R(x,z) := EX(y, R(x,y) & R(y,z)); 

CrocoPat evaluates first R(x,y), then R(y,z), then the conjunction, then the existential quantification, 
and finally the assignment. Therefore, the order of the user attributes in the BDD is x, y, z. 

Avoid renaming large relations. The time for the evaluation of the expression R(x,y) is at least 
linear in the number of BDD nodes in the representation of R, because all BDD nodes have to be renamed 
from internal attributes to user attributes. Usually this effort for renaming does not dominate the overall 
runtime, but in the following we give an example where it does. 
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Let R(x,y) be a directed graph with n nodes. Let the BDD representation of R have 0{n'^) BDD 
nodes (which is the worst case). The assignment 

DutneigliborCy) := EX(x, R(x,y) & x="nodel"); 

assigns the outneighbors of the graph node nodel to the set Dutneighbor. The evaluation of R(x,y) costs 
©(n^) time in this example, because of the renaming of all nodes. The "real computation", namely the 
conjunction and the existential quantification, can be done in O(logn) time. So the renaming dominates 
the overall time. 

The equivalent statement 

Dutneighbor (y) := RC'nodel" ,y) ; 

is executed in only 0{n) time, because the set RC'nodel" ,y) has 0{n) elements, and its BDD represen- 
tation has 0{n) nodes. 

Avoid svifapping attributes. Renaming the nodes of a BDD costs at least linear time, but can be 
much more expensive when attributes have to be swapped. In the statement 

S(x,y,z) := R(y,z) & R(x,y) ; 

the BDD attribute order on the right hand side of the assignment is y, z, x, while the BDD attribute 
order on the left hand side is x, y, z. Because the two orders are different, attributes have to be swapped 
to execute the assignment. This can be easily avoided by using the equivalent statement 

S(x,y,z) := R(x,y) & R(y,z) ; 

Of course, swapping attributes can not always be avoided. However, developers of RML programs 
should know that swapping attributes can be expensive, and should minimize it when performance is 
critical. 

Ensure good attribute orders. A detailed discussion of BDD attribute orders is beyond the scope 
of this manual (see e.g. [Bry92 Section 1.3] for details), but the basic rule is that related attributes 
should be grouped together. In the two assignment statements 

Sl(v,w,x,y) := R(v,w) & R(x,y); 
S2(v,x,w,y) := R(v,w) & R(x,y); 

the attributes v and w are related, and the attributes x and y are related, while v and w are unrelated to 
x and y. In SI, related attributes are grouped together, but not in S2. For many relations R, the BDD 
representation of SI will be drastically smaller than the BDD representation of S2. 

Profile. Information about the number of BDD nodes and the BDD attribute order of an expression 
can be printed with PRINT RELINFO. For example, 

PRINT RELINFD(R(y,z) & R(x,y)); 

may output 

Number of tuples in the relation: 461705 
Number of values (universe) : 6218 
Number of BDD nodes: 246986 

Percentage of free nodes in BDD package: 1614430 / 1966102 = 82 7. 
Attribute order: y z x 

The first line gives the cardinality of the relation, the second line the cardinality of the universe, the 
third line the size of the BDD that represents the result of the expression, and the fifth line the attribute 
order in this BDD. 
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3.3 Estimating the Evaluation Time of Relational Expressions 

Knowledge of the computational complexity of RML's operators is useful to optimize the performance 
of RML programs. This subsection gives theoretical complexity results, but also discusses the limits of 
their practical application. 

Tablen]shows the asymptotic worst case time complexity for the evaluation of RML's relational oper- 
ators. The times do not include the renaming of internal attributes discussed in the previous subsection, 
and the evaluation of subexpressions. It is assumed that the caches of the BDD package are sufficiently 
large. This assumption is closely approximated in practice when the manipulated BDDs only occupy a 
small fraction of the available nodes in the BDD package. Otherwise, performance may be improved by 
increasing the BDD memory (see Subsection l3.1|l . 

When the operands of an expression are relations, the computation time is given as function of 
the sizes of their BDD representation. (The only exception are the transitive closure operators, where 
a function of the size of the universe gives a more useful bound.) This raises the problem of how to 
estimate these BDD sizes. Many practical relations have regularities that enable an (often dramatically) 
compressed BDD representation, but the analytical derivation of the typical compression rate for relations 
from a particular application domain is generally difficult. Our advice is to choose some representative 
examples and measure the BDD sizes with the PRINT RELINFO statement. 

It is important to note that Table Ogives worst-case computation times. In many cases, the typical 
practical performance is much better than the worst case. For example, the relational comparison oper- 
ators (<=, <, >=, >) and the binary logic operators (&, I , ->, <->) are very common in RML programs. 
Their worst-case complexity is the product of the sizes of their operand BDDs, which is alarmingly high. 
However, in practice the performance is often much closer to the sum of the operand BDD sizes. Simi- 
larly, the quantification operators are often efficient despite their prohibitive worst case runtime (which 
is difficult to derive because quantification is implemented as a series of several bit-level operations). 

Another practically important example for the gap between average-case and worst-case runtime are 
the transitive closure operators. The worst case complexity of their BDD-based implementation is the 
same as for implementations with conventional data structures. However, the BDD-based implementa- 
tions are much more efficient for many practical graphs BNL03]. Even in the comparison of different 
BDD-based implementations, a better worst-case complexity does not imply a better performance in 
practice. We conclude from our experience that knowledge of the theoretical complexity complements 
but cannot replace experimentation in the development of highly optimized RML programs. 



Table 1. Worst case time complexity of the evaluation of relational expressions, re, rel, re2 are rela- 
tional expressions, x, xl, x2 are attributes, and nel, ne2 are numerical expressions. bddsize{re) is the 
number of BDD nodes of the result of the expression re, and n is the cardinality of the universe. 



! re 

rel & re2, rel I re2 

rel -> re2, rel <-> re2 

EX(x, re), FA(x, re) 

TC(re) 

TCFAST(re) 

FALSE(xl, x2, . . . ) 

TRUE(xl, x2, . . .) 

(§s(x) 

~(xl, x2) 

"(nel, ne2) 

rel = re2, rel != re2 

rel < re2, rel <= re2 

rel > re2, rel >= re2 
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3.4 Extending the Universe 



The set of all strings that may be tuple elements of relations in an RML program is called the universe. 
The universe contains all tuple elements of the input relations (from the input RSF data), and all string 
literals that appear on the left hand side of assignments in the RML program. The universe is immutable 
in the sense that it can be determined before the interpretation of the RML program starts, and is not 
changed during the interpretation of the RML program. 

Sometimes the immutability of the universe is inconvenient for the developer of RML programs. 
Consider, for example, a program that takes as input the nodes and arcs of a graph, and computes the 
binary relation OutneighborCnt which contains for each node the number of outncighbors: 

FDR n IN Node(x) { 

OutneighborCnt (n, #(Arc(n,x))) := TRUEO ; 

} 

This is not a syntactically correct RML program, because #(Arc(n,x)) is not a string, but a number. 
However, RML has an operator STRING that converts a number into a string. But 

OutneighborCnt (n, STRING( #(Arc(n,x)) )) := TRUEO; 

is still not syntactically correct, because such a conversion is not allowed at the left hand side of as- 
signment statements. The reason is that the string that results from such a conversion is generally not 
known before the execution of the RML program, can therefore not be added to the universe before the 
execution, and is thus not allowed as tuple element of a relation. 

The immutability of the universe during the execution of an RML program is necessary because 
constant relations like TRUE(x) (the universe) and =(x,y) (string equality for all strings in the universe) 
are only defined for a given universe. Also, the complement of a relation depends on the universe: The 
complement of a set contains all strings of the universe that are not in the given set, and thus clearly 
changes when the universe changes. 

However, there is a way to work around this limitation: Writing an RSF file, and restarting CrocoPat 
with this RSF file as input, which adds all tuple elements in the RSF file to the universe. For example, 
the above incorrect program can be replaced by the following correct program: 

FOR n IN Node(x) { 

PRINT "OutneighborCnt ", n, " ", #(Arc(n,x)) TO " OutneighborCnt. rsf " ; 

} 

When CrocoPat is restarted with the resulting RSF file OutneighborCnt . rsf as input, the binary 
relation OutneighborCnt is available for further processing. 

4 CrocoPat Reference 

CrocoPat is executed with 

crocopat [OPTION] . . . FILE [ARGUMENT] . . . 

It first reads relations in RSF (see Section[SJl from stdin (unless the option -e is given) and then executes 
the RML program FILE (see SectionEJ. The ARGUMENTS are passed to the RML program. The OPTIONS 
are 

-e Do not read RSF data from stdin. 

-m NUMBER Approximate memory for BDD package in MB. The default is 50. See Subsection lri.il 

-q Suppress warnings. 

-h Display help message and exit. 

-V Print version information and exit. 
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The output of the RML program can be written to files, stdout, or stderr, as specified in the RML 
program. Error messages and warnings of CrocoPat are always written to stderr. 

The exit status of CrocoPat is 1 if it terminates abnormally and otherwise. CrocoPat always outputs 
an error message to stderr before it terminates with exit status 1. 

5 RSF Reference 

Rigi Standard Format (RSF) is CrocoPat 's input and output format for relations. It is an extension of the 
format for binary relations defined in ^Won98 Section 4.7.1]. For examples of its use see Subsection l2.2l 
An RSF stream is a sequence of lines. The order of the lines is arbitrary. The repeated occurrence 
of a line is permissable and has the same meaning as a single occurrence. The end of an RSF stream is 
indicated by the end of the file or by a line that starts with a dot (.). Lines starting with a sharp (#) 
are comment lines. 

All lines that do not start with a dot or a sharp specify a tuple in a relation. They consist of the 
name of the relation followed by a sequence of (an arbitrary number of) tuple elements, separated by at 
least one whitespace character (i.e., space or horizontal tab). 

Relation names must be RML identifiers (see Subsection I6.1|l . Tuple elements are sequences of ar- 
bitrary characters except line breaks and whitespace characters. A tuple element may be optionally 
enclosed by double quotes ("), in which case it may also contain whitespace characters. Tuple elements 
that are enclosed by double quotes in the RSF input of an RML program are also enclosed by double 
quotes in its output. 

6 RML Reference 

Relation Manipulation Language (RML) is CrocoPat's programming language for manipulating relations. 
This section defines the lexical structure, the syntax, and the semantics of RML. Nonterminals are printed 
in italics and terminals in typewriter. 

6.1 Lexical Structure 

Identifiers are sequences of Latin letters (a-zA-z), digits (0-9) and underscores (_), the first of which 
must be a letter or underscore. RML has four types of identifiers: attributes {attribute), relational vari- 
ables (reLvar), string variables {str_var), and numerical variables (num_var) . Every identifier of an RML 
program belongs to exactly one of these types. The type is determined at the first occurrence of the 
identifier in the input RSF file (only possible for relational variables) or in the RML program. Explicit 
declaration of identifiers is not necessary. 

The following strings are reserved as keywords and therefore cannot be used as identifiers: 

AVG DIV ELSE ENDL EX EXEC EXIT FA FDR IF IN MAX MIN MOD NUMBER PRINT RELINFO 
STDERR STRING SUM TC TCFAST TO WHILE 

RML has two types of literals: string literals {strJiteral) and numerical literals (numJiteral) . String 
literals are delimited by double quotes (") and can contain arbitrary characters except double quotes. 
A numerical literal consists of an integer part, a fractional part indicated by a decimal point (.), and 
an exponent indicated by the letter e or E followed by an optionally signed integer. All three parts 
are optional, but at least one digit in the integer part or the fractional part is required. Examples of 
numerical literals are 1, .2, 3., 4.5, and 6e-7. 

There are two kinds of comments: Text starting with /* and ending with */, and text from //to the 
end of the line. 
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6.2 SyntELX and Informal Semantics 



program ::= 

stmt ..} 

stmt ::= 

reLvariterm, ...) := reLexpr; 



reLvariterm, ...) ; 

str-var : = str_expr; 

num_var : = num_expr; 

IF reLexpr {stmt ...} ELSE {stmt .. 

IF reLexpr {stmt ...} 

WHILE reLexpr {stmt ...} 

FOR str_var IN reLexpr {stmt ...} 

PRINT print_expr , . . . ; 

PRINT print_expr,... TO STDERR; 

PRINT print_expr , . . . TO str_expr; 
EXEC str_expr; 

EXIT num_expr; 
{ stmt ... } 

reLexpr ::= 
reLvariterm, ...) 
term reLvar term 
! reLexpr 
reLexpr & reLexpr 
reLexpr I reLexpr 
reLexpr -> reLexpr 
reLexpr <-> reLexpr 



EX( attribute, .. 
FA( attribute, .. 
TC ( reLexpr) 



reLexpr) 
reLexpr) 



TCFASKreLexpr) 
FALSE (ierm, ...) 
TRUE(ierm, ...) 
@str_expr( term) 
'(term, term) 
term ~ term 
~inum_expr, num_expr) 
num_expr ~ num_expr 
reLexpr ~ reLexpr 
{reLexpr) 



RML program. 

Executes the stmts in the given order. 
Statement. 

Assigns the result of reLexpr to reLvar. 
->^The terms must be attributes or strJiterals. 
-!■ The set of attributes among the terms on the left hand side 

must equal the set of free attributes in reLexpr. 

Shortcut for reLuarC term, ...) := TRUE ( term, ...) . 

Assigns the result of str_expr to str_var. 

Assigns the result of num_expr to numjuar. 
.} Executes the stmts before ELSE if the result of reLexpr is TRUEO , 

and the stmts after ELSE (if present) otherwise. 

reLexpr must not have free attributes. 

Exec, the stmts repeatedly as long as reLexpr evaluates to TRUEO. 
reLexpr must not have free attributes. 

Executes the stmts once for each element in the result of reLexpr. 

reLexpr must have exactly one free attribute. 

Writes the results of the print_exprs to stdout. 

Writes the results of the print_exprs to stderr. 

Appends the results of the print_exprs to the specified file. 

Executes the shell command given by str_expr. 

The exit status is available as numerical constant exitStatus. 

Exits CrocoPat with the given exit status. 

Executes the stmts in the given order. 

Relational Expression. The result is a relation. 

Atomic relational expression. 
Same as reLvariterm, term). 
Negation (not). 
Conjunction (and). 
Disjunction (or). 

Implication (if), rl -> r2 is equivalent to ! (rl) I (r2). 
Equivalence (if and only if). 

rl <-> r2 is equivalent to (rl -> r2) & (r2 -> rl). 

Existential quantification of the attributes. 
Universal quantification of the attributes. 
Transitive closure. 
-> reLexpr must have exactly two free attributes. 

Same as TC, but with an alternative algorithm (see Section IT!^ . 
Empty relation. 

Relation containing all tuples of strings in the universe. 
Strings in the universe that match the regular expression str_expr. 
Lexicographical order of all strings in the universe.'^ 
Same as '(term, term). 

Numerical comparison. The result is either TRUEO or FALSEO.'^ 
Same as ~ (num_expr, num_expr) . 

Relational comparison. The result is either TRUEO or FALSE ().^ 



stmt ... denotes a sequence of one or more stmts. 
^ Context conditions are marked with 



can be =, ! = 



<, >= 
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term ::= 

attribute 

str-cxpr 

str_expr ::= 

strJiteral 
str-var 

STRING ( num-expr) 
$ num_expr 

str-expr + str^expr 
(str-cxpr) 

num_expr ::= 

numJiteral 

nuTU-var 

NUMBER (5ir_ea;pr) 
#(reLexpr) 

MIN ( reLexpr) , MAX ( reLexpr) , 



nuni-expr + nurri-expr 
nuTU-expr - num_expr 
num_expr * num_expr 
num_expr / num_expr 
nuru-expr DIV nurri-expr 
num_expr MOD nurri-expr 
nurri-expr ~ nurri-expr 
argCount 
exitStatus 
{num_expr) 

print-Bxpr ::= 

reLexpr 

Lstr-expr'] reLexpr 

str-expr 

num-cxpr 

ENDL 

RELINFO(reLea;pr) 



Term. 

Attribute. 

Anonymous attribute. E.g. R(_) is equivalent toEX(x, R(x)). 
String expression. 

String Expression. The result is a string. 

String literal. 
String variable. 

Converts the result of nurri-expr into a string. 

Command line argument. The first argument has the number 1. 
The constant argCount contains the number of arguments. 
Concatenation. 



Numerical Expression. The result is a floating point number. 

Numerical literal. 
Numerical variable. 

Converts the result of str_expr into a number. Yields 0.0 if 
the result of str.expr is not the string representation of a number. 
Cardinality (number of elements) of the result of reLexpr. 
SUM ( reLexpr) , AVG ( reLexpr) 

Minimum, maximum, sum, and arithmetic mean of NUMBER (s) 
over all strings s in the result of reLexpr. 

reLexpr must have one free attribute, its result must be non-empty. 

Addition. 

Subtraction. 

Multiplication. 

Real division. 

Integer division (truncating). 
Modulo. 

The first nurri-expr raised to the power given by the second nurri-expr. 

Number of command line arguments. 

Exit status of the last executed EXEC statement. 



Print Expression. 

Prints the tuples in the result of reLexpr, one tuple per line. 
Prints the result of str^expr before each tuple of reLexpr. 
Prints the result of str^expr. 
Prints the result of nurri-expr. 
Prints a line break. 

Prints information about the BDD representation of reLexpr. 



Levels of Precedence (from Low to High) 



=, !=, <= 
->. <-> 



<. >= 



+, - (binary) 
*, /, DIV, MOD 



- (unary) 
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Free Attributes in Relational Expressions. Several context conditions of RML refer to the free 
attributes in relational expressions. The number of free attributes in a relational expression equals the 
arity of the resulting relation. Informally, the set of free attributes of an expression is the set of its 
contained attributes that are not in the scope of a quantifier (i.e., EX or FA). Exceptions are the numerical 
and relational comparison, which have Boolean results and therefore no free attributes. Formally, the 
function / that assigns to each relational expression the set of its free attributes is inductively defined 
as follows: 

{ terrrii \ terrrii is an attribute } 
/( reLexpr ) 

) u 
) U 

u 
u 



termn 



•) ) 



/( reLv ar( terrrii , 
/( ! reLexpr ) 
/( reLexpri & reLexpr2 ) 
/( reLexpri I reLexpr2 ) 
/( reLexpri -> reLexpr2 ) 
/( reLexpri <-> reLexpr2 ) 
/( EXiattribute, reLexpr) ) 
/( ¥ki attribute, reLexpr) ) 
/( IC {reLexpr) ) 
/( TCVkSl {reLexpr) ) 
/( {num_expri ~ num_expr2) ) 
/( {reLexpri ~ reLexpr2) ) 
/( {reLexpr) ) 

As before, ~ can be 



/( reLexpri ) 
/( reLexpri ) 
/( reLexpri ) 
/( reLexpri ) 
/( reLexpr ) 
/( reLexpr ) 
/( reLexpr ) 
/( reLexpr ) 



/( reLexpr2 
/( reLexpr2 
/( reLexpr2 
/( reLexpr2 
\ { attribute } 
\ { attribute } 



/( reLexpr ) 

<, >=, or >. The relational constants 
equivalent to reLvar with respect to the definition of free attributes. 



FALSE, TRUE, and ®str_expr are 



Regular Expressions. In the relational expression ®str_expr{term) , the result of str_expr can be any 
POSIX extended regular expression [lEEOll Section 9.4]. A full description is beyond the scope, we only 
give a short overview. 

Most characters in a regular expression only match themselves. The following special characters match 
themselves only when they are preceded by a backslash (\), and otherwise have special meanings: 

Matches any single character. 
[ ] Matches any single character contained within the brackets. 
[" ] Matches any single character not contained within the brackets. 

Matches the start of the string. 
$ Matches the end of the string. 

{x,y} Matches the last character (or regular expression enclosed by parentheses) 

at least x and at most y times. 
+ Matches the last character (or regular expression enclosed by parentheses) one or more times. 

* Matches the last character (or regular expression enclosed by parentheses) zero or more times. 

? Matches the last character (or regular expression enclosed by parentheses) zero or one times. 

I Matches either the expression before or the expression after the operator. 

Regular expressions can be grouped by enclosing them with parentheses ((...)). 



6.3 Formal Semantics 

This subsection formally defines the semantics of the part of RML that deals with relations, namely of 
relational expressions and the relational assignment statement. 

The universe 11 is the finite set of all string literals that appear in the input RSF file, or on the 
left side of a relational assignment. The finite set of attributes of the RML program is denoted by X 
(U n X = 0). An attribute assignment is a total function w : X U U — » U which maps each attribute to its 
value and (for notational convenience) each string literal to itself. The set of all attribute assignments is 
denoted by Val{X). 
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The finite set of relation variables in the RML program is denoted by 31. A relation assignment is a 
total function s : 31 — > IJneN 2^''=^'^, which maps each relation variable to a relation of arbitrary arity. 
The set of all relation assignments is denoted by RelCJV). 

The semantics of relational expressions and statements are given by the following interpretation 
functions: 

: reLexprs {Rel{Ji) 2^»'(^)) 
14 : stmts -> {Rel{Jl) i?eZ(3^)) 

So we define the semantics of an expression as the set of attribute assignments that satisfy the expression, 
and the semantics of a statement as a transformation of the relation assignment. The interpretation 
functions are defined inductively in Figure |21 



\reljuaritermi , .... term„)]j 
[! reLespr]^ 
\reLexpr\ & reLexpr2}^ 
[reLexpn I rel_expr2\^ 
{reLexpri -> reLexpr2}^ 
[reLexpri <-> reLexpr2}^ 

[EX (.attribute, reLexpr)J^ 
lFk(.attribute, reLexpr)J^ 

[TC ( reljuari attributei , attribute2 ) ) ] ^ 



|TCFAST(reL«;ar(attn6ittei , attribute2))}^ 

[TRUE ( termi , ternin ) ] ^ 
[FALSE ( termi , .... ternin)}^ 

[=(termi , term2)}^ 

l<(termi , term2)}^ 

[= ( reLexpri , reLexpr2 ) ] ^ 
[< ( reLexpri , reLexpr2 ) ] ^ 
[(reLexpr)]^ 

IreLvaritermi ,. . .,termn) ■= reLexpr\^{s)^ 



= |ii £ Val{X) I (^{termi), . . . ,v{term„)^ £ s(reLmr)| 

= Val{X) \ IreLexprj^is) 

= [reLexpri j^{s) n lreLexpr2j^{s) 

= [reLexpri] ^{s) U [reLexpr2j^{s) 

= [! (reLexpri) I (reLexpr2)}^{s) 

= [(reLexpn -> reLexpr2) & (reLexpr2 -> reLexpri) j ^{s) 

= |t) G VaZ(X) I 3w' e [reLe2;pr]^(s) Va; £ X \ {attribute} : v(x) = 

= [! EX( attribute, ! (reLespr) )]g(s) 

= [ reLvar( attributei , attribute2 ) 

I EX ( attribute-^ , reLvar( attributei , attributes ) 

& (reLvar( attribute!, , attribute2)))\^{s) 

= [TC(reL?;ar(attn6ittei , attn6itte2) )]g(s) 

= Va«(X) 
= 

= |f G 1/a/(X) I i;(ter-mi) — v{term2)^ 

= |t; G l^ai(X) I v{termi) <ioxicographicaiiy v{term2)^ 

= Vai(X), if [reLexpri] ^{s) = [reLe2;pr2]^(s) 
0, otherwise 

= Val{X), if [reLea;pri]g(s) C [reLe2;pr2]j.(s) 
0, otherwise 

= [reLexprl^{s) 

= s(r-), if r 7^ reLvar 

I ^«(termi), . . . ,w(ferm„)j | G [re/_ea;prjg(s)| 
U |t G s{r) I 3i : termi G 11 A termi 7^ if r = reLvar 



with attribute, attributei, attribute2, attribute-^ G X, termi, terrrin G XulX, and reLvar G !R. The symbol 
denotes the least fixed point. 



Fig. 2. RML semantics 
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